Search CORE

136 research outputs found

Energy Bounds for Fault-Tolerant Nanoscale Designs

Author: Marculescu Diana
Publication venue
Publication date: 25/10/2007
Field of study

The problem of determining lower bounds for the energy cost of a given nanoscale design is addressed via a complexity theory-based approach. This paper provides a theoretical framework that is able to assess the trade-offs existing in nanoscale designs between the amount of redundancy needed for a given level of resilience to errors and the associated energy cost. Circuit size, logic depth and error resilience are analyzed and brought together in a theoretical framework that can be seamlessly integrated with automated synthesis tools and can guide the design process of nanoscale systems comprised of failure prone devices. The impact of redundancy addition on the switching energy and its relationship with leakage energy is modeled in detail. Results show that 99% error resilience is possible for fault-tolerant designs, but at the expense of at least 40% more energy if individual gates fail independently with probability of 1%.Comment: Submitted on behalf of EDAA (http://www.edaa.com/

arXiv.org e-Print Archive

Hardware-Aware Machine Learning: Modeling and Optimization

Author: Cai Ermao
Marculescu Diana
Stamoulis Dimitrios
Publication venue
Publication date: 14/09/2018
Field of study

Recent breakthroughs in Deep Learning (DL) applications have made DL models a key component in almost every modern computing system. The increased popularity of DL applications deployed on a wide-spectrum of platforms have resulted in a plethora of design challenges related to the constraints introduced by the hardware itself. What is the latency or energy cost for an inference made by a Deep Neural Network (DNN)? Is it possible to predict this latency or energy consumption before a model is trained? If yes, how can machine learners take advantage of these models to design the hardware-optimal DNN for deployment? From lengthening battery life of mobile devices to reducing the runtime requirements of DL models executing in the cloud, the answers to these questions have drawn significant attention. One cannot optimize what isn't properly modeled. Therefore, it is important to understand the hardware efficiency of DL models during serving for making an inference, before even training the model. This key observation has motivated the use of predictive models to capture the hardware performance or energy efficiency of DL applications. Furthermore, DL practitioners are challenged with the task of designing the DNN model, i.e., of tuning the hyper-parameters of the DNN architecture, while optimizing for both accuracy of the DL model and its hardware efficiency. Therefore, state-of-the-art methodologies have proposed hardware-aware hyper-parameter optimization techniques. In this paper, we provide a comprehensive assessment of state-of-the-art work and selected results on the hardware-aware modeling and optimization for DL applications. We also highlight several open questions that are poised to give rise to novel hardware-aware designs in the next few years, as DL applications continue to significantly impact associated hardware systems and platforms.Comment: ICCAD'18 Invited Pape

arXiv.org e-Print Archive

Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks

Author: Chin Ting-Wu
Marculescu Diana
Zhang Cha
Publication venue
Publication date: 17/10/2018
Field of study

Resource-efficient convolution neural networks enable not only the intelligence on edge devices but also opportunities in system-level optimization such as scheduling. In this work, we aim to improve the performance of resource-constrained filter pruning by merging two sub-problems commonly considered, i.e., (i) how many filters to prune for each layer and (ii) which filters to prune given a per-layer pruning budget, into a global filter ranking problem. Our framework entails a novel algorithm, dubbed layer-compensated pruning, where meta-learning is involved to determine better solutions. We show empirically that the proposed algorithm is superior to prior art in both effectiveness and efficiency. Specifically, we reduce the accuracy gap between the pruned and original networks from 0.9% to 0.7% with 8x reduction in time needed for meta-learning, i.e., from 1 hour down to 7 minutes. To this end, we demonstrate the effectiveness of our algorithm using ResNet and MobileNetV2 networks under CIFAR-10, ImageNet, and Bird-200 datasets.Comment: 11 pages, 8 figures, work in progres

arXiv.org e-Print Archive

Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines §

Author: Diana Marculescu
Emil Talpes
Publication venue
Publication date: 29/03/2012
Field of study

One of the most important problems faced by microarchitecture designers is the poor scalability of some of the current solutions with increased clock frequencies and wider pipelines. As several studies show, internal processor structures scale differently with decreasing device sizes. While in some cases the access latency is determined by the speed of the logic circuitry, for others it is dominated by the interconnect delay. Furthermore, while some stages can be super-pipelined with relatively small performance loss, others must be kept atomic. This paper proposes a possible solution to this problem, avoiding the traditional trade-off between parallelism and clock speed. First, allowing instructions to enter and leave the Issue Window in an asynchronously manner enables faster speeds in the front-end at the expense of small synchronization latencies. Second, using an Execution Cache for storing instructions that are already scheduled allows for bypassing the issue circuitry and thus clocking the execution core at higher frequencies. Combined, these two mechanisms result in a 50 % to 60 % performance increase for our test microarchitecture, without requiring a completely new scheduling mechanism. Furthermore, the proposed microarchitecture requires significantly less energy, with 30% reduction in a 0.13um or 20 % in a 0.06um process technology over the original baseline. 1

CiteSeerX

Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications

Author: Garg Siddharth
Liu Guangshuo
Marculescu Diana
Turakhia Yatish
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/03/2016
Field of study

Dynamically adaptive multi-core architectures have been proposed as an effective solution to optimize performance for peak power constrained processors. In processors, the micro-architectural parameters or voltage/frequency of each core to be changed at run-time, thus providing a range of power/performance operating points for each core. In this paper, we propose Thread Progress Equalization (TPEq), a run-time mechanism for power constrained performance maximization of multithreaded applications running on dynamically adaptive multicore processors. Compared to existing approaches, TPEq (i) identifies and addresses two primary sources of inter-thread heterogeneity in multithreaded applications, (ii) determines the optimal core configurations in polynomial time with respect to the number of cores and configurations, and (iii) requires no modifications in the user-level source code. Our experimental evaluations demonstrate that TPEq outperforms state-of-the-art run-time power/performance optimization techniques proposed in literature for dynamically adaptive multicores by up to 23%

arXiv.org e-Print Archive

Learning-based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

Author: Doppa Janardhan Rao
Joardar Biresh Kumar
Kim Ryan Gary
Marculescu Diana
Marculescu Radu
Pande Partha Pratim
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/10/2019
Field of study

The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that are computationally powerful, yet energy-efficient. Due to the amount of data parallelism in these algorithms, high-performance 3D manycore platforms that incorporate both CPUs and GPUs present a promising direction. However, as systems use heterogeneity (e.g., a combination of CPUs, GPUs, and accelerators) to improve performance and efficiency, it becomes more pertinent to address the distinct and likely conflicting communication requirements (e.g., CPU memory access latency or GPU network throughput) that arise from such heterogeneity. Unfortunately, it is difficult to quickly explore the hardware design space and choose appropriate tradeoffs between these heterogeneous requirements. To address these challenges, we propose the design of a 3D Network-on-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for a 3D heterogeneous system and explores various tradeoffs using an efficient ML-based multi-objective optimization technique. The proposed design space exploration considers the various requirements of its heterogeneous components and generates a set of 3D NoC architectures that efficiently trades off these design objectives. Our findings show that by jointly considering these requirements (latency, throughput, temperature, and energy), we can achieve 9.6% better Energy-Delay Product on average at nearly iso-temperature conditions when compared to a thermally-optimized design for 3D heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs optimized for a few applications can be generalized for unknown applications as well. Our results show that these generalized 3D NoCs only incur a 1.8% (36-tile system) and 1.1% (64-tile system) average performance loss compared to application-specific NoCs.Comment: Published in IEEE Transactions on Computer

arXiv.org e-Print Archive

Task Scheduling for Heterogeneous Multicore Systems

Author: Chen Zhuo
Marculescu Diana
Publication venue
Publication date: 08/12/2017
Field of study

In recent years, as the demand for low energy and high performance computing has steadily increased, heterogeneous computing has emerged as an important and promising solution. Because most workloads can typically run most efficiently on certain types of cores, mapping tasks on the best available resources can not only save energy but also deliver high performance. However, optimal task scheduling for performance and/or energy is yet to be solved for heterogeneous platforms. The work presented herein mathematically formulates the optimal heterogeneous system task scheduling as an optimization problem using queueing theory. We analytically solve for the common case of two processor types, e.g., CPU+GPU, and give an optimal policy (CAB). We design the GrIn heuristic to efficiently solve for near-optimal policy for any number of processor types (within 1.6% of the optimal). Both policies work for any task size distribution and processing order, and are therefore, general and practical. We extensively simulate and validate the theory, and implement the proposed policy in a CPU-GPU real platform to show the optimal throughput and energy improvement. Comparing to classic policies like load-balancing, our results range from 1.08x~2.24x better performance or 1.08x~2.26x better energy efficiency in simulations, and 2.37x~9.07x better performance in experiments.Comment: heterogeneous systems; scheduling; performance modeling; queueing theor

arXiv.org e-Print Archive

NeuralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks

Author: Cai Ermao
Juan Da-Cheng
Marculescu Diana
Stamoulis Dimitrios
Publication venue
Publication date: 15/10/2017
Field of study

"How much energy is consumed for an inference made by a convolutional neural network (CNN)?" With the increased popularity of CNNs deployed on the wide-spectrum of platforms (from mobile devices to workstations), the answer to this question has drawn significant attention. From lengthening battery life of mobile devices to reducing the energy bill of a datacenter, it is important to understand the energy efficiency of CNNs during serving for making an inference, before actually training the model. In this work, we propose NeuralPower: a layer-wise predictive framework based on sparse polynomial regression, for predicting the serving energy consumption of a CNN deployed on any GPU platform. Given the architecture of a CNN, NeuralPower provides an accurate prediction and breakdown for power and runtime across all layers in the whole network, helping machine learners quickly identify the power, runtime, or energy bottlenecks. We also propose the "energy-precision ratio" (EPR) metric to guide machine learners in selecting an energy-efficient CNN architecture that better trades off the energy consumption and prediction accuracy. The experimental results show that the prediction accuracy of the proposed NeuralPower outperforms the best published model to date, yielding an improvement in accuracy of up to 68.5%. We also assess the accuracy of predictions at the network level, by predicting the runtime, power, and energy of state-of-the-art CNN architectures, achieving an average accuracy of 88.24% in runtime, 88.34% in power, and 97.21% in energy. We comprehensively corroborate the effectiveness of NeuralPower as a powerful framework for machine learners by testing it on different GPU platforms and Deep Learning software tools.Comment: Accepted as a conference paper at ACML 201

arXiv.org e-Print Archive

Towards Efficient Model Compression via Learned Global Ranking

Author: Chin Ting-Wu
Ding Ruizhou
Marculescu Diana
Zhang Cha
Publication venue
Publication date: 14/03/2020
Field of study

Pruning convolutional filters has demonstrated its effectiveness in compressing ConvNets. Prior art in filter pruning requires users to specify a target model complexity (e.g., model size or FLOP count) for the resulting architecture. However, determining a target model complexity can be difficult for optimizing various embodied AI applications such as autonomous robots, drones, and user-facing applications. First, both the accuracy and the speed of ConvNets can affect the performance of the application. Second, the performance of the application can be hard to assess without evaluating ConvNets during inference. As a consequence, finding a sweet-spot between the accuracy and speed via filter pruning, which needs to be done in a trial-and-error fashion, can be time-consuming. This work takes a first step toward making this process more efficient by altering the goal of model compression to producing a set of ConvNets with various accuracy and latency trade-offs instead of producing one ConvNet targeting some pre-defined latency constraint. To this end, we propose to learn a global ranking of the filters across different layers of the ConvNet, which is used to obtain a set of ConvNet architectures that have different accuracy/latency trade-offs by pruning the bottom-ranked filters. Our proposed algorithm, LeGR, is shown to be 2x to 3x faster than prior work while having comparable or better performance when targeting seven pruned ResNet-56 with different accuracy/FLOPs profiles on the CIFAR-100 dataset. Additionally, we have evaluated LeGR on ImageNet and Bird-200 with ResNet-50 and MobileNetV2 to demonstrate its effectiveness. Code available at https://github.com/cmu-enyac/LeGR.Comment: CVPR 2020 Ora

arXiv.org e-Print Archive

Regularizing Activation Distribution for Training Binarized Deep Networks

Author: Chin Ting-Wu
Ding Ruizhou
Liu Zeye
Marculescu Diana
Publication venue
Publication date: 04/04/2019
Field of study

Binarized Neural Networks (BNNs) can significantly reduce the inference latency and energy consumption in resource-constrained devices due to their pure-logical computation and fewer memory accesses. However, training BNNs is difficult since the activation flow encounters degeneration, saturation, and gradient mismatch problems. Prior work alleviates these issues by increasing activation bits and adding floating-point scaling factors, thereby sacrificing BNN's energy efficiency. In this paper, we propose to use distribution loss to explicitly regularize the activation flow, and develop a framework to systematically formulate the loss. Our experiments show that the distribution loss can consistently improve the accuracy of BNNs without losing their energy benefits. Moreover, equipped with the proposed regularization, BNN training is shown to be robust to the selection of hyper-parameters including optimizer and learning rate

arXiv.org e-Print Archive